Goto

Collaborating Authors

 type inference


Evaluating Program Semantics Reasoning with Type Inference in System F

arXiv.org Artificial Intelligence

Large Language Models (LLMs) are increasingly integrated into the software engineering ecosystem. Their test-time compute (TTC) reasoning capabilities show significant potential for understanding program logic and semantics beyond mere token recognition. However, current benchmarks for code reasoning lack a formal, program-centric deductive framework to ensure sound evaluation, and are incapable of assessing whether models genuinely reason about program semantics or merely exploit superficial associations between natural language and code tokens. To bridge this gap, we introduce TF-Bench, a benchmark designed to evaluate LLM reasoning based on type inference in System F, a task we refer to as program semantics reasoning. By employing verified transformations to remove semantically irrelevant natural language, we construct TF-Bench_pure, a purely semantics-driven variant of TF-Bench. Our analysis reveals substantial limitations in state-of-the-art LLMs, with the best-performing LLM (Claude-3.7-sonnet) achieving only 55.85% accuracy on TF-Bench_pure. Additionally, we propose two novel metrics to assess robustness and the effectiveness of test-time reasoning, underscoring critical limitations in current LLM capabilities and highlighting essential directions for future research.


Automated Type Annotation in Python Using Large Language Models

arXiv.org Artificial Intelligence

Type annotations in Python enhance maintainability and error detection. However, generating these annotations manually is error prone and requires extra effort. Traditional automation approaches like static analysis, machine learning, and deep learning struggle with limited type vocabularies, behavioral over approximation, and reliance on large labeled datasets. In this work, we explore the use of LLMs for generating type annotations in Python. We develop a generate check repair pipeline: the LLM proposes annotations guided by a Concrete Syntax Tree representation, a static type checker (Mypy) verifies them, and any errors are fed back for iterative refinement. We evaluate four LLM variants: GPT 4oMini, GPT 4.1mini (general-purpose), and O3Mini, O4Mini (reasoning optimized), on 6000 code snippets from the ManyTypes4Py benchmark. We first measure the proportion of code snippets annotated by LLMs for which MyPy reported no errors (i.e., consistent results): GPT 4oMini achieved consistency on 65.9% of cases (34.1% inconsistent), while GPT 4.1mini, O3Mini, and O4Mini each reached approximately 88.6% consistency (around 11.4% failures). To measure annotation quality, we then compute exact-match and base-type match accuracies over all 6000 snippets: GPT 4.1mini and O3Mini perform the best, achieving up to 70.5% exact match and 79.1% base type accuracy, requiring under one repair iteration on average. Our results demonstrate that general-purpose and reasoning optimized LLMs, without any task specific fine tuning or additional training can be effective in generating consistent type annotations.They perform competitively with traditional deep learning techniques which require large labeled dataset for training. While our work focuses on Python, the pipeline can be extended to other optionally typed imperative languages like Ruby


Type Prediction With Program Decomposition and Fill-in-the-Type Training

arXiv.org Artificial Intelligence

TypeScript and Python are two programming languages that support optional type annotations, which are useful but tedious to introduce and maintain. This has motivated automated type prediction: given an untyped program, produce a well-typed output program. Large language models (LLMs) are promising for type prediction, but there are challenges: fill-in-the-middle performs poorly, programs may not fit into the context window, generated types may not type check, and it is difficult to measure how well-typed the output program is. We address these challenges by building OpenTau, a search-based approach for type prediction that leverages large language models. We propose a new metric for type prediction quality, give a tree-based program decomposition that searches a space of generated types, and present fill-in-the-type fine-tuning for LLMs. We evaluate our work with a new dataset for TypeScript type prediction, and show that 47.4% of files type check (14.5% absolute improvement) with an overall rate of 3.3 type errors per file. All code, data, and models are available at: https://github.com/GammaTauAI/opentau.


Transformer Models for Type Inference in the Simply Typed Lambda Calculus: A Case Study in Deep Learning for Code

arXiv.org Artificial Intelligence

Despite a growing body of work at the intersection of deep learning and formal languages, there has been relatively little systematic exploration of transformer models for reasoning about typed lambda calculi. This is an interesting area of inquiry for two reasons. First, typed lambda calculi are the lingua franc of programming languages. A set of heuristics that relate various typed lambda calculi to effective neural architectures would provide a systematic method for mapping language features (e.g., polymorphism, subtyping, inheritance, etc.) to architecture choices. Second, transformer models are widely used in deep learning architectures applied to code, but the design and hyperparameter space for them is large and relatively unexplored in programming language applications. Therefore, we suggest a benchmark that allows us to explore exactly this through perhaps the simplest and most fundamental property of a programming language: the relationship between terms and types. Consequently, we begin this inquiry of transformer architectures for typed lambda calculi by exploring the effect of transformer warm-up and optimizer selection in the task of type inference: i.e., predicting the types of lambda calculus terms using only transformers. We find that the optimization landscape is difficult even in this simple setting. One particular experimental finding is that optimization by Adafactor converges much faster compared to the optimization by Adam and RAdam. We conjecture that such different performance of optimizers might be related to the difficulties of generalization over formally generated dataset.


Unveiling Project-Specific Bias in Neural Code Models

arXiv.org Artificial Intelligence

Neural code models have introduced significant improvements over many software analysis tasks like type inference, vulnerability detection, etc. Despite the good performance of such models under the common intra-project independent and identically distributed (IID) training and validation setting, we observe that they usually fail to generalize to real-world inter-project out-of-distribution (OOD) setting. In this work, we show that such phenomenon is caused by model heavily relying on project-specific, ungeneralizable tokens like self-defined variable and function names for downstream prediction, and we formulate it as the project-specific bias learning behavior. We propose a measurement to interpret such behavior, termed as Cond-Idf, which combines co-occurrence probability and inverse document frequency to measure the level of relatedness of token with label and its project-specificness. The approximation indicates that without proper regularization with prior knowledge, model tends to leverage spurious statistical cues for prediction. Equipped with these observations, we propose a bias mitigation mechanism Batch Partition Regularization (BPR) that regularizes model to infer based on proper behavior by leveraging latent logic relations among samples. Experimental results on two deep code benchmarks indicate that BPR can improve both inter-project OOD generalization and adversarial robustness while not sacrificing accuracy on IID data.


Cross-Lingual Adaptation for Type Inference

arXiv.org Artificial Intelligence

Deep learning-based techniques have been widely applied to the program analysis tasks, in fields such as type inference, fault localization, and code summarization. Hitherto deep learning-based software engineering systems rely thoroughly on supervised learning approaches, which require laborious manual effort to collect and label a prohibitively large amount of data. However, most Turing-complete imperative languages share similar control- and data-flow structures, which make it possible to transfer knowledge learned from one language to another. In this paper, we propose cross-lingual adaptation of program analysis, which allows us to leverage prior knowledge learned from the labeled dataset of one language and transfer it to the others. Specifically, we implemented a cross-lingual adaptation framework, PLATO, to transfer a deep learning-based type inference procedure across weakly typed languages, e.g., Python to JavaScript and vice versa. PLATO incorporates a novel joint graph kernelized attention based on abstract syntax tree and control flow graph, and applies anchor word augmentation across different languages. Besides, by leveraging data from strongly typed languages, PLATO improves the perplexity of the backbone cross-programming-language model and the performance of downstream cross-lingual transfer for type inference. Experimental results illustrate that our framework significantly improves the transferability over the baseline method by a large margin.


Objective-Rust

#artificialintelligence

This is going to be another one of those posts where I did something ridiculous and then show you how I got there, so let's just get right to it. Yep, that's Rust code with embedded Objective-C syntax, and it works. Why would you do such a thing? Maybe you want tighter interop between the Rust and Objective-C parts of your iOS app. Maybe you want to write your iOS app entirely in Rust.


๐–ell-๐šปyped ๐‘eflections

#artificialintelligence

As one can observe, the basic MLsub type inference algorithm is actually quite similar to the traditional Algorithm J for HM-style type inference, and is not complicated at all! First, note that type variable bounds, which are updated using mutable variables, may very well form cycles as type inference progresses.


Contrastive Code Representation Learning

arXiv.org Artificial Intelligence

Machine-aided programming tools such as type predictors and code summarizers are increasingly learning-based. However, most code representation learning approaches rely on supervised learning with task-specific annotated datasets. We propose Contrastive Code Representation Learning (ContraCode), a self-supervised algorithm for learning task-agnostic semantic representations of programs via contrastive learning. Our approach uses no human-provided labels, relying only on the raw text of programs. In particular, we design an unsupervised pretext task by generating textually divergent copies of source functions via automated source-to-source compiler transforms that preserve semantics. We train a neural model to identify variants of an anchor program within a large batch of negatives. To solve this task, the network must extract program features representing the functionality, not form, of the program. This is the first application of instance discrimination to code representation learning to our knowledge. We pre-train models over 1.8m unannotated JavaScript methods mined from GitHub. ContraCode pre-training improves code summarization accuracy by 7.9% over supervised approaches and 4.8% over RoBERTa pre-training. Moreover, our approach is agnostic to model architecture; for a type inference task, contrastive pre-training consistently improves the accuracy of existing baselines.


LambdaNet: Probabilistic Type Inference using Graph Neural Networks

arXiv.org Machine Learning

As gradual typing becomes increasingly popular in languages like Python and TypeScript, there is a growing need to infer type annotations automatically. While type annotations help with tasks like code completion and static error catching, these annotations cannot be fully determined by compilers and are tedious to annotate by hand. This paper proposes a probabilistic type inference scheme for TypeScript based on a graph neural network. Our approach first uses lightweight source code analysis to generate a program abstraction called a type dependency graph, which links type variables with logical constraints as well as name and usage information. Given this program abstraction, we then use a graph neural network to propagate information between related type variables and eventually make type predictions. Our neural architecture can predict both standard types, like number or string, as well as user-defined types that have not been encountered during training. Our experimental results show that our approach outperforms prior work in this space by $14\%$ (absolute) on library types, while having the ability to make type predictions that are out of scope for existing techniques.